Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 7 de 7
Filter
Add more filters










Database
Language
Publication year range
1.
Sci Rep ; 14(1): 1550, 2024 01 18.
Article in English | MEDLINE | ID: mdl-38233494

ABSTRACT

One of the fundamental computational problems in cancer genomics is the identification of single nucleotide variants (SNVs) from DNA sequencing data. Many statistical models and software implementations for SNV calling have been developed in the literature, yet, they still disagree widely on real datasets. Based on an empirical Bayesian approach, we introduce a local false discovery rate (LFDR) estimator for germline SNV calling. Our approach learns model parameters without prior information, and simultaneously accounts for information across all sites in the genomic regions of interest. We also propose another LFDR-based algorithm that reliably prioritizes a given list of mutations called by any other variant-calling algorithm. We use a suite of gold-standard cell line data to compare our LFDR approach against a collection of widely used, state of the art programs. We find that our LFDR approach approximately matches or exceeds the performance of all of these programs, despite some very large differences among them. Furthermore, when prioritizing other algorithms' calls by our LFDR score, we find that by manipulating the type I-type II tradeoff we can select subsets of variant calls with minimal loss of sensitivity but dramatic increases in precision.


Subject(s)
Nucleotides , Polymorphism, Single Nucleotide , Bayes Theorem , Nucleotides/genetics , Software , Algorithms , High-Throughput Nucleotide Sequencing
2.
Bioinform Adv ; 2(1): vbac049, 2022.
Article in English | MEDLINE | ID: mdl-36699374

ABSTRACT

Motivation: The rapid single-cell transcriptomic technology developments have led to an increasing interest in cellular heterogeneity within cell populations. Although cell-type proportions can be obtained directly from single-cell RNA sequencing (scRNA-seq), it is costly and not feasible in every study. Alternatively, with fewer experimental complications, cell-type compositions are characterized from bulk RNA-seq data. Many computational tools have been developed and reported in the literature. However, they fail to appropriately incorporate the covariance structures in both scRNA-seq and bulk RNA-seq datasets in use. Results: We present a covariance-based single-cell decomposition (CSCD) method that estimates cell-type proportions in bulk data through building a reference expression profile based on a single-cell data, and learning gene-specific bulk expression transformations using a constrained linear inverse model. The approach is similar to Bisque, a cell-type decomposition method that was recently developed. Bisque is limited to a univariate model, thus unable to incorporate gene-gene correlations into the analysis. We introduce a more advanced model that successfully incorporates the covariance structures in both scRNA-seq and bulk RNA-seq datasets into the analysis, and fixes the collinearity issue by utilizing a linear shrinkage estimation of the corresponding covariance matrices. We applied CSCD to several publicly available datasets and measured the performance of CSCD, Bisque and six other common methods in the literature. Our results indicate that CSCD is more accurate and comprehensive than most of the existing methods. Availability and implementation: The R package is available on https://github.com/empiricalbayes/CSCDRNA.

3.
BMC Med Genomics ; 13(1): 156, 2020 10 15.
Article in English | MEDLINE | ID: mdl-33059707

ABSTRACT

BACKGROUND: Treating cancer depends in part on identifying the mutations driving each patient's disease. Many clinical laboratories are adopting high-throughput sequencing for assaying patients' tumours, applying targeted panels to formalin-fixed paraffin-embedded tumour tissues to detect clinically-relevant mutations. While there have been some benchmarking and best practices studies of this scenario, much variant calling work focuses on whole-genome or whole-exome studies, with fresh or fresh-frozen tissue. Thus, definitive guidance on best choices for sequencing platforms, sequencing strategies, and variant calling for clinical variant detection is still being developed. METHODS: Because ground truth for clinical specimens is rarely known, we used the well-characterized Coriell cell lines GM12878 and GM12877 to generate data. We prepared samples to mimic as closely as possible clinical biopsies, including formalin fixation and paraffin embedding. We evaluated two well-known targeted sequencing panels, Illumina's TruSight 170 hybrid-capture panel and the amplification-based Oncomine Focus panel. Sequencing was performed on an Illumina NextSeq500 and an Ion Torrent PGM respectively. We performed multiple replicates of each assay, to test reproducibility. Finally, we applied four different freely-available somatic single-nucleotide variant (SNV) callers to the data, along with the vendor-recommended callers for each sequencing platform. RESULTS: We did not observe major differences in variant calling success within the regions that each panel covers, but there were substantial differences between callers. All had high sensitivity for true SNVs, but numerous and non-overlapping false positives. Overriding certain default parameters to make them consistent between callers substantially reduced discrepancies, but still resulted in high false positive rates. Intersecting results from multiple replicates or from different variant callers eliminated most false positives, while maintaining sensitivity. CONCLUSIONS: Reproducibility and accuracy of targeted clinical sequencing results depend less on sequencing platform and panel than on variability between replicates and downstream bioinformatics. Differences in variant callers' default parameters are a greater influence on algorithm disagreement than other differences between the algorithms. Contrary to typical clinical practice, we recommend employing multiple variant calling pipelines and/or analyzing replicate samples, as this greatly decreases false positive calls.


Subject(s)
Algorithms , Biomarkers, Tumor/genetics , DNA Mutational Analysis/methods , Mutation , Neoplasms/genetics , Neoplasms/pathology , Polymorphism, Single Nucleotide , Computational Biology , Formaldehyde , Gene Expression Profiling , Gene Expression Regulation, Neoplastic , Humans , Paraffin Embedding , Reproducibility of Results , Tumor Cells, Cultured
4.
Article in English | MEDLINE | ID: mdl-30113898

ABSTRACT

In a genome-wide association study (GWAS), the probability that a single nucleotide polymorphism (SNP) is not associated with a disease is its local false discovery rate (LFDR). The LFDR for each SNP is relative to a reference class of SNPs. For example, the LFDR of an exonic SNP can vary widely depending on whether it is considered relative to the separate reference class of other exonic SNPs or relative to the combined reference class of all SNPs in the data set. As a result, the analysis of the data based on the combined reference class might indicate that a specific exonic SNP is associated with the disease, while using the separate reference class indicates that it is not associated, or vice versa. To address that, we introduce empirical Bayes methods that simultaneously consider a combined reference class and a separate reference class. Our simulation studies indicate that the proposed methods lead to improved performance. The new maximum entropy method achieves that by depending on the separate class when it has enough SNPs for reliable LFDR estimation and depending solely on the combined class otherwise. We used the new methods to analyze data from a GWAS of 2,000 cases and 3,000 controls. R functions implementing the proposed methods are available on CRAN and Shiny .


Subject(s)
Computational Biology/methods , Genome-Wide Association Study/methods , Polymorphism, Single Nucleotide/genetics , Bayes Theorem , Coronary Artery Disease/genetics , Databases, Genetic , Entropy , Humans
5.
J Theor Biol ; 459: 119-129, 2018 12 14.
Article in English | MEDLINE | ID: mdl-30266462

ABSTRACT

In this paper, we assume that allele frequencies are random variables and follow certain statistical distributions. However, specifying an appropriate informative prior distribution with specific hyperparameters seems to be a major issue. Assuming that prior information varies over some classes of priors, we develop the concept of robust Bayes estimation into the context of allele frequency estimation. We first assume that the region of interest is a single locus and the prior information is represented in terms of a class of Beta distributions, and present explicit forms of the resulting Bayes and robust Bayes estimators. We then extend our results to biallelic k-loci and multi-allelic k-loci cases within the region of interest. We perform a simulation study to measure performance of the proposed robust Bayes estimators against some Bayes estimators associated with specific hyperparameters. The simulations reflect satisfactory performance of the proposed robust Bayes estimators when there is no evidence implying the actual prior distribution.


Subject(s)
Bayes Theorem , Gene Frequency , Statistics as Topic/methods , Computer Simulation , Decision Support Techniques , Genetic Loci , Humans , Statistical Distributions
6.
Front Psychol ; 9: 699, 2018.
Article in English | MEDLINE | ID: mdl-29867666

ABSTRACT

We argue that making accept/reject decisions on scientific hypotheses, including a recent call for changing the canonical alpha level from p = 0.05 to p = 0.005, is deleterious for the finding of new discoveries and the progress of science. Given that blanket and variable alpha levels both are problematic, it is sensible to dispense with significance testing altogether. There are alternatives that address study design and sample size much more directly than significance testing does; but none of the statistical tools should be taken as the new magic method giving clear-cut mechanical answers. Inference should not be based on single studies at all, but on cumulative evidence from multiple independent studies. When evaluating the strength of the evidence, we should consider, for example, auxiliary assumptions, the strength of the experimental design, and implications for applications. To boil all this down to a binary decision based on a p-value threshold of 0.05, 0.01, 0.005, or anything else, is not acceptable.

7.
PLoS One ; 12(9): e0185174, 2017.
Article in English | MEDLINE | ID: mdl-28931044

ABSTRACT

The maximum entropy (ME) method is a recently-developed approach for estimating local false discovery rates (LFDR) that incorporates external information allowing assignment of a subset of tests to a category with a different prior probability of following the null hypothesis. Using this ME method, we have reanalyzed the findings from a recent large genome-wide association study of coronary artery disease (CAD), incorporating biologic annotations. Our revised LFDR estimates show many large reductions in LFDR, particularly among the genetic variants belonging to annotation categories that were known to be of particular interest for CAD. However, among SNPs with rare minor allele frequencies, the reductions in LFDR were modest in size.


Subject(s)
Coronary Artery Disease/genetics , Gene Frequency , Genome-Wide Association Study/methods , Polymorphism, Single Nucleotide , Genetic Predisposition to Disease , Genome-Wide Association Study/statistics & numerical data , Humans , Models, Genetic , Probability
SELECTION OF CITATIONS
SEARCH DETAIL
...